Real world multimedia data is often composed of multiple modalities such asan image or a video with associated text (e.g. captions, user comments, etc.)and metadata. Such multimodal data packages are prone to manipulations, where asubset of these modalities can be altered to misrepresent or repurpose datapackages, with possible malicious intent. It is, therefore, important todevelop methods to assess or verify the integrity of these multimedia packages.Using computer vision and natural language processing methods to directlycompare the image (or video) and the associated caption to verify the integrityof a media package is only possible for a limited set of objects and scenes. Inthis paper, we present a novel deep learning-based approach for assessing thesemantic integrity of multimedia packages containing images and captions, usinga reference set of multimedia packages. We construct a joint embedding ofimages and captions with deep multimodal representation learning on thereference dataset in a framework that also provides image-caption consistencyscores (ICCSs). The integrity of query media packages is assessed as theinlierness of the query ICCSs with respect to the reference dataset. We presentthe MultimodAl Information Manipulation dataset (MAIM), a new dataset of mediapackages from Flickr, which we make available to the research community. We useboth the newly created dataset as well as Flickr30K and MS COCO datasets toquantitatively evaluate our proposed approach. The reference dataset does notcontain unmanipulated versions of tampered query packages. Our method is ableto achieve F1 scores of 0.75, 0.89 and 0.94 on MAIM, Flickr30K and MS COCO,respectively, for detecting semantically incoherent media packages.
展开▼